Reliability in Content Analysis: Some Common Misconceptions and Recommendations

نویسنده

  • Klaus Krippendorff
چکیده

In a recent article published in this journal, Lombard, Snyder-Duch, and Bracken (2002) surveyed 200 content analyses for their reporting of reliability tests; compared the virtues and drawbacks of five popular reliability measures; and proposed guidelines and standards for their use. Their discussion revealed that numerous misconceptions circulate in the content analysis literature regarding how these measures behave and can aid or deceive content analysts in their effort to ensure the reliability of their data. This paper proposes three conditions for statistical measures to serve as indices of the reliability of data and examines the mathematical structure and the behavior of the five coefficients discussed by the authors, plus two others. It compares common beliefs about these coefficients with what they actually do and concludes with alternative recommendations for testing reliability in content analysis and similar data-making efforts. Disciplines Communication | Social and Behavioral Sciences This journal article is available at ScholarlyCommons: http://repository.upenn.edu/asc_papers/242 Manuscript published in Human Communication Research 30, 3: 411-433, 2004. Reliability in Content Analysis: Some Common Misconceptions and Recommendations Klaus Krippendorff University of Pennsylvania [email protected] Abstract In a recent article published in this journal, Lombard, Snyder-Duch, and Bracken (2002) surveyed 200 content analyses for their reporting of reliability tests; compared the virtues and drawbacks of five popular reliability measures; and proposed guidelines and standards for their use. Their discussion revealed that numerous misconceptions circulate in the content analysis literature regarding how these measures behave and can aid or deceive content analysts in their effort to ensure the reliability of their data. This paper proposes three conditions for statistical measures to serve as indices of the reliability of data and examines the mathematical structure and the behavior of the five coefficients discussed by the authors, plus two others. It compares common beliefs about these coefficients with what they actually do and concludes with alternative recommendations for testing reliability in content analysis and similar data-making efforts.In a recent article published in this journal, Lombard, Snyder-Duch, and Bracken (2002) surveyed 200 content analyses for their reporting of reliability tests; compared the virtues and drawbacks of five popular reliability measures; and proposed guidelines and standards for their use. Their discussion revealed that numerous misconceptions circulate in the content analysis literature regarding how these measures behave and can aid or deceive content analysts in their effort to ensure the reliability of their data. This paper proposes three conditions for statistical measures to serve as indices of the reliability of data and examines the mathematical structure and the behavior of the five coefficients discussed by the authors, plus two others. It compares common beliefs about these coefficients with what they actually do and concludes with alternative recommendations for testing reliability in content analysis and similar data-making efforts. In a recent paper published in a special issue of Human Communication Research devoted to methodological topics (Vol. 28, No. 4), Lombard, Snyder-Duch, and Bracken (2002) presented their findings of how reliability was treated in 200 content analyses indexed in Communication Abstracts between 1994 and 1998. In essence, their results showed that only 69% of the articles report reliabilities. This amounts to no significant improvements in reliability concerns over earlier studies (e.g., Pasadeos et al., 1995; Riffe & Freitag, 1996). Lombard et al. attribute the failure of consistent reporting of reliability of content analysis data to a lack of available guidelines, and they end up proposing such guidelines. Having come to their conclusions by content analytic means, Lombard et al. also report their own reliabilities, using not one, but four, indices for comparison: %-agreement; Scott‟s (1955)  (pi); Cohen‟s (1960)  (kappa); and Krippendorff‟s (1970, 2004)  (alpha). Faulty software 1 initially led the authors to miscalculations, now corrected (Lombard et al., 2003). However, in their original article, the authors cite several common beliefs about these coefficients and make recommendations that I contend can seriously mislead content analysis researchers, thus prompting my corrective response. To put the discussion of the purpose of these indices into a larger perspective, I will have to go beyond the arguments presented in their article. Readers who might find the technical details tedious are invited to go to the conclusion, which is in the form of four recommendations. The Conservative/Liberal Continuum Lombard et al. report “general agreement (in the literature) that indices which do not account for chance agreement (%-agreement and Holsti‟s [1969] CR – actually Osgood‟s [1959, p.44] index) are too liberal while those that do (, , and ) are too conservative” (2002, p. 593). For liberal or “more lenient” coefficients, the authors recommend adopting higher critical values for accepting data as reliable than for conservative or “more stringent” ones (p. 600) – as if differences between these coefficients were merely a problem of locating them on a shared scale. Discussing reliability coefficients in terms of a conservative/liberal continuum is not widespread in the technical literature. It entered the writing on content analysis not so long ago. Neuendorf (2002) used this terminology, but only in passing. Before that, Potter and Lewine-Donnerstein (1999, p. 287) cited Perreault and Leigh‟s (1989, p. 138) assessment of the chance-corrected  as being “overly conservative” and “difficult to compare (with) ... Cronbach‟s (1951) alpha,” for example – as if the comparison with a correlation coefficient mattered. I contend that trying to understand diverse agreement coefficients by their numerical results alone, conceptually placing them on a conservative/liberal continuum, is seriously misleading. Statistical coefficients are mathematical functions. They apply to a collection of data (records, values, or numbers) and result in one numerical index intended to inform its users about something – here about whether they can rely on their data. Differences among coefficients are due to responding to (a) different patterns in data and/or (b) the same patterns but in different ways. How these functions respond to which patterns of agreement and how their numerical results relate to the risk of drawing false conclusions from unreliable data – not just the numbers they produce – must be understood before selecting one coefficient over another. Issues of Scale Let me start with the ranges of the two broad classes of agreement coefficients, chancecorrected agreement and raw or %-agreement. While both kinds equal 1.000 or 100% when agreement is perfect, and data are considered reliable, %-agreement is zero when absolutely no agreement is observed; when one coder‟s categories unfailingly differ from the categories used by the other; or disagreement is systematic and extreme. Extreme disagreement is statistically almost as unexpected as perfect agreement. It should not occur, however, when coders apply the same coding instruction to the same set of units of analysis and work independently of each other, as is required when generating data for testing reliability. Where the reliability of data is an issue, the worst situation is not when one coder looks over the shoulder of another coder and selects a non-matching category, but when coders do not understand what they are asked to interpret, categorize by throwing dice, or examine unlike units of analysis, causing research results that are indistinguishable from chance events. While zero %-agreement has no meaningful reliability interpretation, chance-corrected agreement coefficients, by contrast, become zero when coders‟ behavior bears no relation to the phenomena to be coded, leaving researchers clueless as to what their data mean. Thus, the scales of chance-corrected agreement coefficients are anchored at two points of meaningful reliability interpretations, zero and one, whereas %-like agreement indices are anchored in only one, 100%, which renders all deviations from 100% uninterpretable, as far as data reliability is concerned. %-agreement has other undesirable properties; for example, it is limited to nominal data; can compare only two coders 2 ; and high %-agreement becomes progressively unlikely as more categories are available. I am suggesting that the convenience of calculating %-agreement, which is often cited as its advantage, cannot compensate for its meaninglessness. Let me hasten to add that chance-correction is not a panacea either. Chance-corrected agreement coefficients do not form a uniform class. Benini (1901), Bennett, Alpert, and Goldstein (1954), Cohen (1960), Goodman and Kruskal (1954), Krippendorff (1970, 2004), and Scott (1955) build different corrections into their coefficients, thus measuring reliability on slightly different scales. Chance can mean different things. Discussing these coefficients in terms of being conservative (yielding lower values than expected) or liberal (yielding higher values than expected) glosses over their crucial mathematical differences and privileges an intuitive sense of the kind of magnitudes that are somehow considered acceptable. If it were the issue of striking a balance between conservative and liberal coefficients, it would be easy to follow statistical practices and modify larger coefficients by squaring them and smaller coefficients by applying the square root to them. However, neither transformation would alter what these mathematical functions actually measure; only the sizes of the intervals between 0 and 1. Lombard et al., by contrast, attempt to resolve their dilemma by recommending that content analysts use several reliability measures. In their own report, they use , “an index ...known to be conservative,” but when  measures below .700, they revert to %-agreement, “a liberal index,” and accept data as reliable as long as the latter is above .900 (2002, p. 596). They give no empirical justification for their choice. I shall illustrate below the kind of data that would pass their criterion. Relation Between Agreement and Reliability To be clear, agreement is what we measure; reliability is what we wish to infer from it. In content analysis, reproducibility is arguably the most important interpretation of reliability (Krippendorff, 2004, p.215). I am suggesting that an agreement coefficient can become an index of reliability only when (1) It is applied to proper reliability data. Such data result from duplicating the process of describing, categorizing, or measuring a sample of data obtained from the population of data whose reliability is in question. Typically, but not exclusively, duplications are achieved by employing two or more widely available coders or observers who, working independent of each other, apply the same coding instructions or recording devices to the same set of units of analysis. (2) It treats units of analysis as separately describable or categorizable, without, however, presuming any knowledge about the correctness of their descriptions or categories. What matters, therefore, is not truths, correlations, subjectivity, or the predictability of one particular coder‟s use of categories from that by another coder, but agreements or disagreements among multiple descriptions generated by a coding procedure, regardless of who enacts that procedure. Reproducibility is about data making, not about coders. A coefficient for assessing the reliability of data must treat coders as interchangeable and count observable coder idiosyncrasies as disagreement. (3) Its values correlate with the conditions under which one is willing to rely on imperfect data. The correlation between a measure of agreement and the rely-ability on data involves two kinds of inferences. Estimating the (dis)agreement in a population of data from the (dis)agreements observed and measured in a subsample of these data is an inductive step and a function of the number of coders involved and the proportion of units in the recoded data. Inferring the (un)reliability of data from the estimated (dis)agreements is an abductive step and justifiable mainly in terms of the (economical, social, or scientific) consequences of using imperfect data. An index of the degree of reliability must have at least two designated values, one to know when reliability is perfect, and the other to know when the conclusions drawn from imperfect data are valid by mere chance. Note that (1) defines a precondition for measuring reliability. No single coefficient can determine whether coders are widely available, use the same instructions, work independently, and code identical units of analysis. Researchers must ensure their peers or critics that the reliability data they generate satisfy these conditions. Many methodological problems in testing reliability stem from violating the requirement for coders to be truly independent, being given coding instructions they cannot follow, or applying them to data that they fail understand. The two methodological problems considered here result from choosing inadequate measures of agreement – calling something a reliability coefficient does not make it so – and applying indefensible decision criteria on their results. Since Lombard et al. discuss the relative merits of the above mentioned measures, correctly citing widely published but disputable claims, I feel compelled to provide mathematical demonstrations of how these coefficients actually differ and whether they satisfy (2) and (3) above. Let me discuss several better-known candidates. A Comparison of Seven Agreement Coefficients To begin, Lombard et al. are correct in discouraging the use of association, correlation, and consistency coefficients, including Cronbach‟s (1951) alpha, as indices of reliability in content analysis. Association measures respond to any deviation from chance contingencies between variables, correlations moreover from linearity, whereas (2) stipulates that reliability must be indicated by measures of agreement among multiple descriptions. Although the authors do not report on how often content analysis researchers fail to realize this crucial difference and use inappropriate indices (I could cite numerous examples of such uses and even name explicit proponents of such practices), one cannot strongly enough warn against the use of correlation statistics in reliability tests. I agree with the authors‟ assessment of the inappropriateness of such coefficients, and therefore need not consider them here. However, I take issue with their presentation of the differences among chance-corrected agreement coefficients. A crucial point is whether and how the population of data whose reliability is in question enters the mathematical form of a coefficient, whether not only (2) but also (3) is satisfied. To illustrate the issues involved, I shall compare the five coefficients that the Lombard et al. found to be most commonly used, plus Benini‟s (1901)  (beta), and Bennett, Alpert, and Goldstein‟s (1954) S in their most elementary forms: for dichotomous data generated by two coders. In such a severely restricted but mathematically exceptionally transparent situation, reliability data can be represented by means of the familiar proportions a, b, c, and d of a 2-by-2 contingency table, shown in Figure 1. In this figure, a+d is the observed %-agreement, Ao; b+c is the observed %-disagreement; and its marginal sums show the proportions p of 0s and q=1p of 1s as used by the two coders A and B, respectively. Figure 1 Generic 2-by-2 Contingency Table

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Variance of soil parameters: some common misconceptions

This paper highlights some important concepts in the determination of a representative variance for use in a reliability analysis. Five basic issues are addressed. The linkage to the actual engineering problem is emphasised. These are not new concepts particular to reliability analysis, but analogous to the need for representative soil parameters for making deterministic prediction for real eng...

متن کامل

Types of cough and therapeutic recommendations from the perspective of Persian medicine

Background and Aim: Although many treatments reduce cough and its complications, cough is still one of the most common and annoying problems of patients who refer to medical centers. Therefore, finding simple and effective treatments is one of great interest to the medical community. This study aimed to determine the types of cough and find simple and effective therapies based on Persian medici...

متن کامل

Common misconceptions about Lyme disease.

Lyme disease, infection with Borrelia burgdorferi, is a focally endemic tick-transmitted zoonosis. During the 3 decades since the responsible spirochete was identified, a series of misconceptions and misunderstandings have become widely prevalent, leading to frequent misdiagnosis and inappropriate treatment. Persistent misconceptions concern the reliability of available diagnostic tools, the si...

متن کامل

Psychometric characteristics of nursing care complexity scale in medication errors, Iran

Twenty-first century challenges of nursing work is increasing complexity of care in the workplace. On the other word, medical errors is major challenge threaten for patient safety in all countries. The most common medical errors that identified are medication errors. With changing patterns of health services, the complexity increases in all workplaces. Since the medication administration is the...

متن کامل

Third Graders’ Misconceptions about Evaporation and Liquefaction Phenomena

Third Graders’ Misconceptions about Evaporation and Liquefaction Phenomena A. Badriyaan, Ph.D. Concepts of evaporation and liquefaction are of significance in both daily life and the school curriculum, yet misconceptions about these two phenomena are frequent. To assess the extent of such misconceptions among third graders, a cluster sample of 132 of them from four schools i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015